kvcoord: always start TxnCoordSender heartbeat loop #67215

erikgrinaker · 2021-07-05T12:30:50Z

kvcoord: always start TxnCoordSender heartbeat loop

Previously, potential 1PC transactions (containing an EndTxn in the
same batch as writes) did not start the TxnCoordSender heartbeat loop,
and thus never wrote a transaction record. This was a performance
optimization to avoid checking for an existing record during write
evaluation. However, it could cause problems under heavy contention
where such transactions would abort each other as they did not heartbeat
their transaction record and thus would become invalid once the txn
timeout expired. This would apply even to transactions that turned out
to not use 1PC due to spanning multiple ranges, and thus are more
vulnerable to contention.

However, benchmarks showed that this optimization did not actually have
a noticable effect on performance (new always heartbeats):

name                          old ops/sec  new ops/sec  delta
kv0/enc=false/nodes=1/cpu=32   24.5k ± 8%   24.7k ± 3%   ~     (p=0.796 n=9+9)
kv0/enc=false/nodes=3/cpu=32   36.7k ± 4%   36.5k ± 4%   ~     (p=0.529 n=10+10)
kv95/enc=false/nodes=3/cpu=32  124k ± 7%    131k ± 2%  +5.56%  (p=0.002 n=10+8)
ycsb/A/nodes=3/cpu=32          27.7k ± 4%   28.1k ± 4%   ~     (p=0.315 n=9+10)

name                          old p50      new p50      delta
kv0/enc=false/nodes=1/cpu=32    2.50 ± 0%    2.50 ± 0%   ~     (all equal)
kv0/enc=false/nodes=3/cpu=32    4.70 ± 0%    4.70 ± 0%   ~     (all equal)
kv95/enc=false/nodes=3/cpu=32   1.00 ± 0%    1.00 ± 0%   ~     (all equal)
ycsb/A/nodes=3/cpu=32           2.44 ± 2%    2.45 ± 2%   ~     (p=1.000 n=10+10)

name                          old p95      new p95      delta
kv0/enc=false/nodes=1/cpu=32    4.79 ± 9%    4.70 ± 0%   ~     (p=0.350 n=9+7)
kv0/enc=false/nodes=3/cpu=32    10.7 ± 3%    10.8 ± 7%   ~     (p=1.000 n=10+10)
kv95/enc=false/nodes=3/cpu=32   5.02 ±10%    4.70 ± 0%  -6.37%  (p=0.003 n=10+8)
ycsb/A/nodes=3/cpu=32           9.10 ± 8%    9.20 ± 3%   ~     (p=0.720 n=10+10)

name                          old p99      new p99      delta
kv0/enc=false/nodes=1/cpu=32    6.90 ±10%    6.87 ± 6%   ~     (p=1.000 n=9+9)
kv0/enc=false/nodes=3/cpu=32    14.4 ± 9%    14.4 ± 6%   ~     (p=0.707 n=10+9)
kv95/enc=false/nodes=3/cpu=32   9.79 ± 7%    9.21 ± 3% -5.90%  (p=0.015 n=10+8)
ycsb/A/nodes=3/cpu=32           78.0 ± 9%    75.5 ± 6%   ~     (p=0.305 n=10+10)

This patch therefore always starts the heartbeat loop for all
transactions, but only after the heartbeat interval (1 second) passes.
In the typical 1PC case, we therefore do not spawn a goroutine nor
write a transaction record anyway.

The EndTxn.TxnHeartbeating field is retained for compatibility with
v21.1 nodes, otherwise these nodes would never check for txn records
when evaluating possible 1PC writes.

Resolves #57042.

Release note (performance improvement): Improved concurrency control for
heavily contended write queries outside of transactions that touch
multiple ranges, reducing excessive aborts and retries.

kvserver: use timestamp cache to mark and check txn records

Potential 1PC transactions must currently check for an existing
transaction record on disk. This patch changes the timestamp cache
tombstone markers to be added when the record is first written rather
than when it's finalized, which allows using the timestamp cache for
this check, avoiding a disk read.

This also changes the name
ABORT_REASON_ALREADY_COMMITTED_OR_ROLLED_BACK_POSSIBLE_REPLAY to
ABORT_REASON_RECORD_ALREADY_WRITTEN_POSSIBLE_REPLAY, to reflect the
new tombstone semantics. The Protobuf identifier is retained for wire
compatibility.

Release note (api change): The transaction abort error reason
ABORT_REASON_ALREADY_COMMITTED_OR_ROLLED_BACK_POSSIBLE_REPLAY has been
renamed to ABORT_REASON_RECORD_ALREADY_WRITTEN_POSSIBLE_REPLAY.

cockroach-teamcity · 2021-07-05T12:30:59Z

This change is

nvanbenschoten

Thanks for running these benchmarks. This is all very positive and indicates to me that this is the direction to move in. I am interested in exploring whether we can remove the HasTxnRecord check altogether, but either way, these benchmarks show that this is a viable path forward that avoids all of the client-side complexity that we discussed in #57042.

Out of everything here, I'm actually most surprised that the added goroutine per 1PC txn didn't have an impact on performance. What did you think of the proposal to avoid this in the common case with a time.AfterFunc call that's delayed by the first heartbeat interval? Timers are bound to a specific P in the Go runtime, so this should not create any new lock contention.

Reviewed 6 of 6 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @erikgrinaker)

pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater_test.go, line 241 at r1 (raw file):

		etReq := ba.Requests[1].GetInner().(*roachpb.EndTxnRequest)
		require.False(t, etReq.TxnHeartbeating)

Why remove these assertions if we're keeping the field?

pkg/kv/kvserver/replica_test.go, line 11269 at r1 (raw file):

				et, etH := endTxnArgs(txn, true /* commit */)
				et.InFlightWrites = inFlightWrites
				et.TxnHeartbeating = true

Do we want to remove these?

pkg/kv/kvserver/replica_write.go, line 415 at r1 (raw file):

	// The EndTxn checks whether the txn record can be created, but we're
	// eliding the EndTxn. So, we'll do the check instead.
	ok, minCommitTS, reason := r.CanCreateTxnRecord(ctx, ba.Txn.ID, ba.Txn.Key, ba.Txn.MinTimestamp)

I do still wonder what it would take to make a positive result from CanCreateTxnRecord imply a negative result from HasTxnRecord. We create transaction records in exactly two places, EndTxn and HeartbeatTxn. See the state machine diagram on CanCreateTxnRecord. So I think if we bumped the timestamp cache on HeartbeatTxn like we do for EndTxn in (*Replica).updateTimestampCache then we may be able to avoid the HasTxnRecord check altogether.

erikgrinaker

I am interested in exploring whether we can remove the HasTxnRecord check altogether

I'll give this a try as a separate commit (or PR, if you'd prefer).

Out of everything here, I'm actually most surprised that the added goroutine per 1PC txn didn't have an impact on performance. What did you think of the proposal to avoid this in the common case with a time.AfterFunc call that's delayed by the first heartbeat interval? Timers are bound to a specific P in the Go runtime, so this should not create any new lock contention.

In the general case, a write transaction will involve at least a network hop and an fsync, and the latency of spawning a goroutine will be dwarfed by these. I suppose it could have an effect under very high concurrency or load, although I suspect even then it will be marginal compared to other costs -- in any case, we shouldn't optimize this until it's been demonstrated to be a problem (the nightly roachperf benchmarks will presumably reveal this if it's significant).

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @nvanbenschoten)

pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater_test.go, line 241 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Why remove these assertions if we're keeping the field?

Mostly because I started off by just removing the field outright, but you're right, we should retain assertions that we're setting them as necessary for 21.1 compatibility.

pkg/kv/kvserver/replica_test.go, line 11269 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do we want to remove these?

Yeah, I think these are fine to remove since the code doesn't care about the field at all.

pkg/kv/kvserver/replica_write.go, line 415 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I do still wonder what it would take to make a positive result from CanCreateTxnRecord imply a negative result from HasTxnRecord. We create transaction records in exactly two places, EndTxn and HeartbeatTxn. See the state machine diagram on CanCreateTxnRecord. So I think if we bumped the timestamp cache on HeartbeatTxn like we do for EndTxn in (*Replica).updateTimestampCache then we may be able to avoid the HasTxnRecord check altogether.

I'll have a look!

erikgrinaker

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @nvanbenschoten)

pkg/kv/kvserver/replica_write.go, line 415 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

I'll have a look!

I need to read up on the timestamp cache and the surrounding code, but I added a naïve proof of concept as a separate commit -- is something like this what you had in mind? It's unclear to me whether this is safe if the entry gets evicted from the cache, although I suppose in the best case we just leak a transaction record (assuming there is no possible scenario in which the txn could also have laid down an intent).

In any case, I ran a quick benchmark with this commit and it doesn't appear to have a significant effect, so I do wonder whether it's even worth it.

name                          old ops/sec  new ops/sec  delta
kv0/enc=false/nodes=1/cpu=32   23.2k ±12%   24.0k ± 4%   ~     (p=0.497 n=10+9)

name                          old p50      new p50      delta
kv0/enc=false/nodes=1/cpu=32    2.52 ± 3%    2.50 ± 0%   ~     (p=0.529 n=8+9)

name                          old p95      new p95      delta
kv0/enc=false/nodes=1/cpu=32    5.21 ±15%    5.00 ± 6%   ~     (p=0.513 n=10+9)

name                          old p99      new p99      delta
kv0/enc=false/nodes=1/cpu=32    7.78 ±21%    7.21 ± 6%   ~     (p=0.163 n=10+9)

I'm off next week, but let me know if you'd like me to merge the initial commit here and revisit the timestamp cache separately.

andreimatei

Out of everything here, I'm actually most surprised that the added goroutine per 1PC txn didn't have an impact on performance. What did you think of the proposal to avoid this in the common case with a time.AfterFunc call that's delayed by the first heartbeat interval? Timers are bound to a specific P in the Go runtime, so this should not create any new lock contention.

In the general case, a write transaction will involve at least a network hop and an fsync, and the latency of spawning a goroutine will be dwarfed by these. I suppose it could have an effect under very high concurrency or load, although I suspect even then it will be marginal compared to other costs -- in any case, we shouldn't optimize this until it's been demonstrated to be a problem (the nightly roachperf benchmarks will presumably reveal this if it's significant).

It's not about the latency of spawning a goroutine, I think, but about the impact on throughput. The kv0 benchmark you ran pushed the throughput to saturation?

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @nvanbenschoten)

pkg/kv/kvserver/replica_tscache.go, line 124 at r3 (raw file):

			}
		case *roachpb.HeartbeatTxnRequest:
			key := transactionRecordMarker(start, txnID)

I think Nathan had in mind using TransactionTombstoneMarker here.

pkg/kv/kvserver/replica_tscache.go, line 510 at r3 (raw file):

	_, recordTxnID := r.store.tsCache.GetMax(recordKey, nil /* end */)
	if (recordTxnID != uuid.UUID{}) {
		return false, minCommitTS, 0

please use a new reason

pkg/kv/kvserver/replica_tscache.go, line 557 at r3 (raw file):

		}
	}

spurious line?

pkg/kv/kvserver/replica_write.go, line 415 at r1 (raw file):

It's unclear to me whether this is safe if the entry gets evicted from the cache, although I suppose in the best case we just leak a transaction record

I think maybe there's some confusion about how the tscache works. Entries don't simply get evicted from it. Instead, the cache ratchets up the timestamp it tracks over a whole key spans, thereby loosing precision. The point is, once a key is tracked at ts1, the tscache will always return a ts >= ts1 for it.

erikgrinaker

It's not about the latency of spawning a goroutine, I think, but about the impact on throughput. The kv0 benchmark you ran pushed the throughput to saturation?

I believe so, yeah -- it used the default workload concurrency of 48 -- but I can do some more benchmarks to make sure once we finalize the design here.

Since these goroutines basically immediately block on a timer, I believe the only way they can have a significant effect on throughput is by putting pressure on the Go scheduler -- this is what I alluded to in my comment, as that'll only happen under very high concurrency or load. I don't really see that being a problem until we're beyond tens of thousands of concurrent goroutines/transactions, and at that scale I suspect other factors will dominate throughput. But it's worth running some benchmarks at very high concurrency to see what effect it has, if any.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @erikgrinaker, and @nvanbenschoten)

pkg/kv/kvserver/replica_tscache.go, line 124 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I think Nathan had in mind using TransactionTombstoneMarker here.

We need to distinguish between a transaction that has already been finalized (i.e. ABORT_REASON_ALREADY_COMMITTED_OR_ROLLED_BACK_POSSIBLE_REPLAY) and one that has written a non-finalized record. In canAttempt1PCEvaluation, the first case should return a txn error while the second case shouldn't (it should instead take the 2PC branch). I don't think this is possible using transactionTombstoneMarker alone. We could change the error code to e.g. ABORT_REASON_RECORD_EXISTS, and then have callers check the txn status, take appropriate action, and convert to code to ABORT_REASON_ALREADY_COMMITTED_OR_ROLLED_BACK_POSSIBLE_REPLAY before propagating it to clients (to preserve API backwards compatibility). However, this feels convoluted and error-prone.

pkg/kv/kvserver/replica_tscache.go, line 510 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

please use a new reason

Yep, this is just a placeholder until I figure out how we want to solve this.

pkg/kv/kvserver/replica_tscache.go, line 557 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

spurious line?

Yup, fixed.

pkg/kv/kvserver/replica_write.go, line 415 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

It's unclear to me whether this is safe if the entry gets evicted from the cache, although I suppose in the best case we just leak a transaction record

I think maybe there's some confusion about how the tscache works. Entries don't simply get evicted from it. Instead, the cache ratchets up the timestamp it tracks over a whole key spans, thereby loosing precision. The point is, once a key is tracked at ts1, the tscache will always return a ts >= ts1 for it.

No, I get that. But once the low water mark passes this txn's entry timestamp, I believe GetMax() will no longer return the txn ID, and so we can't know whether that entry ever existed. I thought that would, for our purposes, be equivalent to it getting evicted from the cache -- but I guess we can at least check the txn timestamp against the low water mark and know whether it might have been evicted. In that case I guess we'll want to do something similar to tombstones, i.e. return an ABORT_REASON_TIMESTAMP_CACHE_REJECTED error to the client and retry the txn.

andreimatei

It's not about the latency of spawning a goroutine, I think, but about the impact on throughput. The kv0 benchmark you ran pushed the throughput to saturation?

I believe so, yeah -- it used the default workload concurrency of 48 -- but I can do some more benchmarks to make sure once we finalize the design here.
Since these goroutines basically immediately block on a timer, I believe the only way they can have a significant effect on throughput is by putting pressure on the Go scheduler -- this is what I alluded to in my comment, as that'll only happen under very high concurrency or load. I don't really see that being a problem until we're beyond tens of thousands of concurrent goroutines/transactions, and at that scale I suspect other factors will dominate throughput. But it's worth running some benchmarks at very high concurrency to see what effect it has, if any.

Like Nathan was saying, always starting hb goroutines should double the goroutine creation rate in some workloads. I would expect that to have some effect - for example because of allocating all the stacks. So, yeah, I would push harder on benchmarking (or just use the timer). A concurrency of 48 is not that high; depending on what machine you've tested against, I doubt that it saturates the CPU. Also make sure you're running kv0 with a uniform distribution across sufficiently many ranges.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @erikgrinaker, and @nvanbenschoten)

pkg/kv/kvserver/replica_tscache.go, line 124 at r3 (raw file):

In canAttempt1PCEvaluation, the first case should return a txn error while the second case shouldn't (it should instead take the 2PC branch

canAttempt1PCEvaluation doesn't necessarily have to return an error in any case. It can just ignore the reason returned by CanCreateTxnRecord and returnfalse, thus triggering the fallback to EndTxn evaluation (which reads the txn record from storage instead of relying on the ts cache to tell it anything). In my opinion, that'd be better than introducing a 3rd tscache key dealing with txn records. We would, however, need to update the comments on the transactionTombstoneMarker() to reflect the fact that it's only meaningful when there's no transaction record.

pkg/kv/kvserver/replica_write.go, line 415 at r1 (raw file):

I thought that would, for our purposes, be equivalent to it getting evicted from the cache

It's not equivalent. If the tombstone marker was simply never in the cache, then then CanCreateTxnRecord might say true. If it ever was, then it will always say false (with either ABORT_REASON_ALREADY_COMMITTED_OR_ROLLED_BACK_POSSIBLE_REPLAY or ABORT_REASON_TIMESTAMP_CACHE_REJECTED).

So, what I think we should do, take CanCreateTxnRecord() == false to fallback to EndTxn evaluation.
Makes sense?

nvanbenschoten

But it's worth running some benchmarks at very high concurrency to see what effect it has, if any.

A concurrency of 48 is not that high; depending on what machine you've tested against

Agreed. I imagine we'll start seeing a more pronounced effect, if one exists, on larger machines. Have you tried running kv0 on a 32 vCPU machine with 1x replication? I'd imagine the most comprehensive trial would be to test with three variations on that configuration: first with what you have here, second with a timer to defer goroutine creation, and third without a heartbeat loop at all. Also, we don't necessarily need to make this change in this PR, but I would encourage you to keep pulling on this thread. Otherwise, it's going to be a while until we return to this.

Reviewed 1 of 1 files at r2, 3 of 3 files at r4, 3 of 3 files at r5.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei)

pkg/kv/kvserver/replica_tscache.go, line 124 at r3 (raw file):

canAttempt1PCEvaluation doesn't necessarily have to return an error in any case. It can just ignore the reason returned by CanCreateTxnRecord and returnfalse, thus triggering the fallback to EndTxn evaluation (which reads the txn record from storage instead of relying on the ts cache to tell it anything). In my opinion, that'd be better than introducing a 3rd tscache key dealing with txn records. We would, however, need to update the comments on the transactionTombstoneMarker() to reflect the fact that it's only meaningful when there's no transaction record.

Yes, I agree with both of the points here. Instead of introducing a new key, I think it would be better to fall back to EndTxn evaluation (return false from canAttempt1PCEvaluation) when CanCreateTxnRecord returns false.

erikgrinaker

always starting hb goroutines should double the goroutine creation rate in some workloads. I would expect that to have some effect - for example because of allocating all the stacks. So, yeah, I would push harder on benchmarking (or just use the timer). A concurrency of 48 is not that high; depending on what machine you've tested against, I doubt that it saturates the CPU. Also make sure you're running kv0 with a uniform distribution across sufficiently many ranges.

Agreed. I imagine we'll start seeing a more pronounced effect, if one exists, on larger machines. Have you tried running kv0 on a 32 vCPU machine with 1x replication? I'd imagine the most comprehensive trial would be to test with three variations on that configuration: first with what you have here, second with a timer to defer goroutine creation, and third without a heartbeat loop at all.

You're right, deferring the heartbeat loop with time.AfterFunc() increases throughput by about ~3% when running 4096 concurrent kv0 writers on a single 32-CPU node with 1000 ranges. I've run a bunch of benchmarks here, choice picks:

old=always goroutine + tscache new=afterfunc + tscache

name                                    old ops/sec  new ops/sec  delta
kv0/enc=false/nodes=1/cpu=32/conc=4096   42.1k ± 1%   43.5k ± 1%  +3.32%  (p=0.000 n=9+10)

name                                    old p50      new p50      delta
kv0/enc=false/nodes=1/cpu=32/conc=4096    96.5 ± 0%    92.3 ± 0%  -4.35%  (p=0.000 n=8+8)

name                                    old p95      new p95      delta
kv0/enc=false/nodes=1/cpu=32/conc=4096     168 ± 0%     176 ± 0%  +5.01%  (p=0.000 n=10+10)

name                                    old p99      new p99      delta
kv0/enc=false/nodes=1/cpu=32/conc=4096     193 ± 0%     199 ± 3%  +3.05%  (p=0.005 n=9+10)

old=afterfunc + tscache new=no heartbeat loop + tscache

name                                    old ops/sec  new ops/sec  delta
kv0/enc=false/nodes=1/cpu=32/conc=4096   43.5k ± 1%   43.6k ± 2%   ~     (p=0.274 n=10+8)

name                                    old p50      new p50      delta
kv0/enc=false/nodes=1/cpu=32/conc=4096    92.3 ± 0%    92.8 ± 4%   ~     (p=1.000 n=8+8)

name                                    old p95      new p95      delta
kv0/enc=false/nodes=1/cpu=32/conc=4096     176 ± 0%     176 ± 0%   ~     (all equal)

name                                    old p99      new p99      delta
kv0/enc=false/nodes=1/cpu=32/conc=4096     199 ± 3%     195 ± 3%   ~     (p=0.153 n=10+8)

old=afterfunc + HasTxnRecord new=afterfunc + tscache

name                                    old ops/sec  new ops/sec  delta
kv0/enc=false/nodes=1/cpu=32/conc=4096   42.1k ± 1%   43.4k ± 1%  +2.90%  (p=0.000 n=10+9)

name                                    old p50      new p50      delta
kv0/enc=false/nodes=1/cpu=32/conc=4096    96.5 ± 0%    92.3 ± 0%  -4.35%  (p=0.000 n=10+8)

name                                    old p95      new p95      delta
kv0/enc=false/nodes=1/cpu=32/conc=4096     181 ± 3%     176 ± 0%  -2.75%  (p=0.006 n=10+9)

name                                    old p99      new p99      delta
kv0/enc=false/nodes=1/cpu=32/conc=4096     201 ± 0%     201 ± 0%    ~     (all equal)

old=master new=afterfunc + tscache

name                                    old ops/sec  new ops/sec  delta
kv0/enc=false/nodes=1/cpu=32/conc=4096   43.8k ± 2%   43.4k ± 1%  -0.99%  (p=0.015 n=8+9)

name                                    old p50      new p50      delta
kv0/enc=false/nodes=1/cpu=32/conc=4096    92.3 ± 0%    92.3 ± 0%    ~     (all equal)

name                                    old p95      new p95      delta
kv0/enc=false/nodes=1/cpu=32/conc=4096     176 ± 0%     176 ± 0%    ~     (all equal)

name                                    old p99      new p99      delta
kv0/enc=false/nodes=1/cpu=32/conc=4096     193 ± 0%     201 ± 0%  +4.35%  (p=0.000 n=8+9)

I've updated the PR with time.AfterFunc and using the tscache tombstone markers to check records, should be good to review (barring any test failures). Thanks for pushing on this, I'm just wary of trading complexity for performance without measuring it first.

time.AfterFunc will end up spawning two goroutines (one for the AfterFunc and another for the nested RunAsyncTask), which may explain some of the increased p99 latencies and the slight throughput reduction from master. I'm guessing we want to keep using RunAsyncTask since we enforce its use, but let me know if you think the bare AfterFunc goroutine will do here.

Dismissed @andreimatei from a discussion.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @nvanbenschoten)

pkg/kv/kvserver/replica_tscache.go, line 124 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

canAttempt1PCEvaluation doesn't necessarily have to return an error in any case. It can just ignore the reason returned by CanCreateTxnRecord and returnfalse, thus triggering the fallback to EndTxn evaluation (which reads the txn record from storage instead of relying on the ts cache to tell it anything). In my opinion, that'd be better than introducing a 3rd tscache key dealing with txn records. We would, however, need to update the comments on the transactionTombstoneMarker() to reflect the fact that it's only meaningful when there's no transaction record.

Yes, I agree with both of the points here. Instead of introducing a new key, I think it would be better to fall back to EndTxn evaluation (return false from canAttempt1PCEvaluation) when CanCreateTxnRecord returns false.

True, we can defer error generation to EndTxn, I've implemented this. I'm not thrilled about the semantic complexity of tombstone markers also acting as markers for existing (non-deleted) records, but it gives the right result at least. We may want to consider renaming the markers, although it's not clear how to do this consistently without also changing ABORT_REASON_ALREADY_COMMITTED_OR_ROLLED_BACK_POSSIBLE_REPLAY.

pkg/kv/kvserver/replica_tscache.go, line 557 at r3 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Yup, fixed.

Done.

pkg/kv/kvserver/replica_write.go, line 415 at r1 (raw file):

It's not equivalent. If the tombstone marker was simply never in the cache, then then CanCreateTxnRecord might say true. If it ever was, then it will always say false (with either ABORT_REASON_ALREADY_COMMITTED_OR_ROLLED_BACK_POSSIBLE_REPLAY or ABORT_REASON_TIMESTAMP_CACHE_REJECTED).

Yeah, but that's only because a transaction timestamp below the low water mark will never be allowed to write a transaction record, right? The bit I missed was that the caller can use the timestamps to know whether a "negative" (i.e. empty txn ID) is definitely true or possibly false. The existing behavior of CanCreateTxnRecord makes sense to me, and serves as a good illustration.

erikgrinaker

time.AfterFunc will end up spawning two goroutines (one for the AfterFunc and another for the nested RunAsyncTask), which may explain some of the increased p99 latencies and the slight throughput reduction from master.

This obviously isn't the case, since the p99 latencies are well below 1 second.

Anyway, I did another benchmark run with the latest PR code but always spawning a goroutine, to doublecheck that I hadn't gotten anything mixed up, which yielded:

name                                    old ops/sec  new ops/sec  delta
kv0/enc=false/nodes=1/cpu=32/conc=4096   43.2k ± 0%   42.0k ± 1%  -2.79%  (p=0.000 n=9+10)

name                                    old p50      new p50      delta
kv0/enc=false/nodes=1/cpu=32/conc=4096    93.6 ± 3%    98.6 ± 2%  +5.39%  (p=0.001 n=10+10)

name                                    old p95      new p95      delta
kv0/enc=false/nodes=1/cpu=32/conc=4096     176 ± 0%     168 ± 0%  -4.77%  (p=0.000 n=10+10)

name                                    old p99      new p99      delta
kv0/enc=false/nodes=1/cpu=32/conc=4096     201 ± 0%     193 ± 0%  -4.17%  (p=0.000 n=10+9)

The tail latencies here are back down to the master baseline, so it seems like they can be explained by time.AfterFunc -- I think this is a reasonable tradeoff. Throughput reduction is consistent with previous results.

As for the ~1% throughput reduction we saw before when comparing master with this PR, I suspect that's attributable to the additional processing on the hot path: the heartbeat loop preparation and tscache lookup. Since this only seems to have an effect under high concurrency when the node is overloaded, and not during nominal loads, I think that's probably OK; wdyt?

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @erikgrinaker, and @nvanbenschoten)

nvanbenschoten

time.AfterFunc will end up spawning two goroutines (one for the AfterFunc and another for the nested RunAsyncTask), which may explain some of the increased p99 latencies and the slight throughput reduction from master. I'm guessing we want to keep using RunAsyncTask since we enforce its use, but let me know if you think the bare AfterFunc goroutine will do here.

If time.AfterFunc's callback runs in a new goroutine, then I don't see a strong reason to use RunAsyncTask. Do you? We could use RunTask to ensure that the goroutine is tracked in the Stopper without the extra hop.

Reviewed 1 of 4 files at r8, 2 of 4 files at r9.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @erikgrinaker, and @nvanbenschoten)

pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater.go, line 311 at r9 (raw file):

			// If we fail to kick off the heartbeat loop, any changes to the txn
			// record will be detected on commit or rollback, so we just log it.
			log.VEventf(ctx, 1, "Failed to start txn heartbeat loop: %v", err)

nit: most log lines don't start with a capital letter.

pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater.go, line 316 at r9 (raw file):

	h.mu.loopCancel = func() {
		if !timer.Stop() {

Any reason not to call hbCancel either way?

pkg/kv/kvserver/replica_tscache.go, line 124 at r3 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

True, we can defer error generation to EndTxn, I've implemented this. I'm not thrilled about the semantic complexity of tombstone markers also acting as markers for existing (non-deleted) records, but it gives the right result at least. We may want to consider renaming the markers, although it's not clear how to do this consistently without also changing ABORT_REASON_ALREADY_COMMITTED_OR_ROLLED_BACK_POSSIBLE_REPLAY.

I agree with your hesitation towards adding in a "special case" here. A lot of these "the marker is also overloaded" comments are quite unfortunate and we should try to avoid them.

But I don't necessarily think this is overloading the meaning. In its strictest form, a tombstone is written when removing an object to prevent it from being recreated. But in practice, it's often easier to create the tombstone at creation time instead. #64966 is another example of that. As it turns out, even EndTxn's handling here is an example of that, as we don't necessarily GC the transaction record during an EndTxn, we just mark it as finalized.

So I'd see how far you can get in updating these interactions without considering this a special case. If that means you need to change the name of ABORT_REASON_ALREADY_COMMITTED_OR_ROLLED_BACK_POSSIBLE_REPLAY, that seems fine.

pkg/kv/kvserver/replica_tscache.go, line 119 at r9 (raw file):

			// transaction's MinTimestamp, which is consulted in
			// CanCreateTxnRecord.
			if br.Txn.Status.IsFinalized() {

Related to the discussion below, I think we'll also want to remove this condition and write the tombstone marker even if the EndTxn has created our txn record with a STAGING status.

pkg/kv/kvserver/replica_write.go, line 394 at r9 (raw file):

func (r *Replica) canAttempt1PCEvaluation(
	ctx context.Context, ba *roachpb.BatchRequest, latchSpans *spanset.SpanSet,
) (bool, *roachpb.Error) {

Will we ever return an error from this method anymore? If not, let's remove the error return value.

pkg/kv/kvserver/replica_write.go, line 414 at r9 (raw file):

2PC evaluation

I don't know that we would necessarily consider the non-1PC fast-path to be "2 phase-commit". Consider just saying "non-1PC"

pkg/roachpb/api.go, line 1217 at r9 (raw file):

// HeartbeatTxn updates the timestamp cache with transaction records,
// to avoid checking for them on disk when considering 1PC evaluation.

nit: this makes it sound like HeartbeatTxnRequest is the one checking for them. Consider rewording to "to avoid the need to check for them on disk when considering 1PC evaluation".

erikgrinaker

If time.AfterFunc's callback runs in a new goroutine, then I don't see a strong reason to use RunAsyncTask. Do you? We could use RunTask to ensure that the goroutine is tracked in the Stopper without the extra hop.

Not at all, forgot about RunTask. Updated.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @erikgrinaker, and @nvanbenschoten)

pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater.go, line 316 at r9 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Any reason not to call hbCancel either way?

I could go either way, just figured we'd avoid a channel close which I'd expect to involve some sort of CPU synchronization instruction. Not likely to matter much either way, so updated it to always cancel.

pkg/kv/kvserver/replica_tscache.go, line 124 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I agree with your hesitation towards adding in a "special case" here. A lot of these "the marker is also overloaded" comments are quite unfortunate and we should try to avoid them.

But I don't necessarily think this is overloading the meaning. In its strictest form, a tombstone is written when removing an object to prevent it from being recreated. But in practice, it's often easier to create the tombstone at creation time instead. #64966 is another example of that. As it turns out, even EndTxn's handling here is an example of that, as we don't necessarily GC the transaction record during an EndTxn, we just mark it as finalized.

So I'd see how far you can get in updating these interactions without considering this a special case. If that means you need to change the name of ABORT_REASON_ALREADY_COMMITTED_OR_ROLLED_BACK_POSSIBLE_REPLAY, that seems fine.

I suppose that's true, if we squint a little. Changed the reason to ABORT_REASON_RECORD_ALREADY_WRITTEN_POSSIBLE_REPLAY. Not sure to what extent we try to maintain backwards compatibility, I suppose it's fine for the next major release as long as it's called out in release notes.

pkg/kv/kvserver/replica_tscache.go, line 119 at r9 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Related to the discussion below, I think we'll also want to remove this condition and write the tombstone marker even if the EndTxn has created our txn record with a STAGING status.

Good catch, done.

pkg/kv/kvserver/replica_write.go, line 394 at r9 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Will we ever return an error from this method anymore? If not, let's remove the error return value.

Good catch, done.

nvanbenschoten

mod the tracing questions. Nice job on this.

Reviewed 1 of 5 files at r10, 6 of 6 files at r11.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei and @erikgrinaker)

pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater.go, line 305 at r11 (raw file):

		// We want the loop to run in a span linked to the current one, so we put
		// our span in the context and fork it.
		hbCtx = tracing.ContextWithSpan(hbCtx, tracing.SpanFromContext(ctx))

Is the use of ctx here racy? We're going to call h.mu.loopCancel on the way back through this interceptor if we're planning on aborting and finishing the span, so we'll call timer.Stop before any span in the context is finished, but that doesn't mean that this timer won't fire. So it's possible for this logic to outlive the tracing span in its context, right? Will that cause issues?

cc. @andreimatei I don't have a good mental model of how this works, other than that I know I've been burnt by it before. Any thoughts?

pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater.go, line 306 at r11 (raw file):

		// our span in the context and fork it.
		hbCtx = tracing.ContextWithSpan(hbCtx, tracing.SpanFromContext(ctx))
		hbCtx, _ = tracing.ForkSpan(hbCtx, taskName)

Do we need to .Finish() this tracing span at some point?

erikgrinaker

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei and @nvanbenschoten)

pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater.go, line 305 at r11 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Is the use of ctx here racy? We're going to call h.mu.loopCancel on the way back through this interceptor if we're planning on aborting and finishing the span, so we'll call timer.Stop before any span in the context is finished, but that doesn't mean that this timer won't fire. So it's possible for this logic to outlive the tracing span in its context, right? Will that cause issues?

cc. @andreimatei I don't have a good mental model of how this works, other than that I know I've been burnt by it before. Any thoughts?

Contexts are generally expected to be immutable, so any modification to a context (and its contained values) should be done by creating a new context. I have no idea if the tracing stuff follows those conventions though.

I updated the code to do set up the tracing span synchronously, out of caution. The span forking was previously done on the async goroutine, so will do a quick benchmark run and see how it affects performance.

pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater.go, line 306 at r11 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do we need to .Finish() this tracing span at some point?

We do, nice catch.

erikgrinaker

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei and @nvanbenschoten)

pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater.go, line 305 at r11 (raw file):

will do a quick benchmark run and see how it affects performance.

No effect.

andreimatei

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @erikgrinaker, and @nvanbenschoten)

pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater.go, line 308 at r13 (raw file):

	timer := time.AfterFunc(h.loopInterval, func() {
		if err := h.stopper.RunTask(hbCtx, taskName, h.heartbeatLoop); err != nil {
			// If we fail to kick off the heartbeat loop, any changes to the txn

I think few people would understand what this comment is telling them. I don't necessarily understand it myself. I'd remove it; I don't feel a need for any comment here.
I'd also get rid of the logging or, if I keep it, it'd put it lower than level 1 (at least the more usual 2). RunTask fails only if the stopper is quiescing; I don't think that deserves any logging.

pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater.go, line 310 at r13 (raw file):

			// If we fail to kick off the heartbeat loop, any changes to the txn
			// record will be detected on commit or rollback, so we just log it.
			log.VEventf(ctx, 1, "failed to start txn heartbeat loop: %v", err)

nit: s/%v/%s. %v usually tells me that something unusual is going on - like for example that err can be nil

andreimatei

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @erikgrinaker, and @nvanbenschoten)

pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater.go, line 305 at r11 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

will do a quick benchmark run and see how it affects performance.

No effect.

was there a race in the hbCtx = tracing.ContextWithSpan(hbCtx, tracing.SpanFromContext(ctx)) line, regardless of whether timer.Stop was called or not? I don't see it.

I'd put the lines

	hbCtx = tracing.ContextWithSpan(hbCtx, tracing.SpanFromContext(ctx))
	hbCtx, span := tracing.ForkSpan(hbCtx, taskName)

back into the timer func. I'd also put the span.Stop() inside the timer func. That way, you don't open the door to a use-after-finish in the VEventf() when loopCancel is called. We've made our spans resilient to this kind of use-after-finish, but it's still not nice.

erikgrinaker

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei and @nvanbenschoten)

pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater.go, line 305 at r11 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

was there a race in the hbCtx = tracing.ContextWithSpan(hbCtx, tracing.SpanFromContext(ctx)) line, regardless of whether timer.Stop was called or not? I don't see it.

I'd put the lines
	hbCtx = tracing.ContextWithSpan(hbCtx, tracing.SpanFromContext(ctx))
	hbCtx, span := tracing.ForkSpan(hbCtx, taskName)
back into the timer func. I'd also put the span.Stop() inside the timer func. That way, you don't open the door to a use-after-finish in the VEventf() when loopCancel is called. We've made our spans resilient to this kind of use-after-finish, but it's still not nice.

Ok, done. It wasn't clear to me whether SpanFromContext() and ForkSpan() were concurrency-safe.

pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater.go, line 308 at r13 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I think few people would understand what this comment is telling them. I don't necessarily understand it myself. I'd remove it; I don't feel a need for any comment here.
I'd also get rid of the logging or, if I keep it, it'd put it lower than level 1 (at least the more usual 2). RunTask fails only if the stopper is quiescing; I don't think that deserves any logging.

Removed. The idea was to make it clear that it is safe to run the transaction without a heartbeat loop.

pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater.go, line 310 at r13 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

nit: s/%v/%s. %v usually tells me that something unusual is going on - like for example that err can be nil

Updated.

Previously, potential 1PC transactions (containing an `EndTxn` in the same batch as writes) did not start the `TxnCoordSender` heartbeat loop, and thus never wrote a transaction record. This was a performance optimization to avoid checking for an existing record during write evaluation. However, it could cause problems under heavy contention where such transactions would abort each other as they did not heartbeat their transaction record and thus would become invalid once the txn timeout expired. This would apply even to transactions that turned out to not use 1PC due to spanning multiple ranges, and thus are more vulnerable to contention. However, benchmarks showed that this optimization did not actually have a noticable effect on performance (new always heartbeats): ``` name old ops/sec new ops/sec delta kv0/enc=false/nodes=1/cpu=32 24.5k ± 8% 24.7k ± 3% ~ (p=0.796 n=9+9) kv0/enc=false/nodes=3/cpu=32 36.7k ± 4% 36.5k ± 4% ~ (p=0.529 n=10+10) kv95/enc=false/nodes=3/cpu=32 124k ± 7% 131k ± 2% +5.56% (p=0.002 n=10+8) ycsb/A/nodes=3/cpu=32 27.7k ± 4% 28.1k ± 4% ~ (p=0.315 n=9+10) name old p50 new p50 delta kv0/enc=false/nodes=1/cpu=32 2.50 ± 0% 2.50 ± 0% ~ (all equal) kv0/enc=false/nodes=3/cpu=32 4.70 ± 0% 4.70 ± 0% ~ (all equal) kv95/enc=false/nodes=3/cpu=32 1.00 ± 0% 1.00 ± 0% ~ (all equal) ycsb/A/nodes=3/cpu=32 2.44 ± 2% 2.45 ± 2% ~ (p=1.000 n=10+10) name old p95 new p95 delta kv0/enc=false/nodes=1/cpu=32 4.79 ± 9% 4.70 ± 0% ~ (p=0.350 n=9+7) kv0/enc=false/nodes=3/cpu=32 10.7 ± 3% 10.8 ± 7% ~ (p=1.000 n=10+10) kv95/enc=false/nodes=3/cpu=32 5.02 ±10% 4.70 ± 0% -6.37% (p=0.003 n=10+8) ycsb/A/nodes=3/cpu=32 9.10 ± 8% 9.20 ± 3% ~ (p=0.720 n=10+10) name old p99 new p99 delta kv0/enc=false/nodes=1/cpu=32 6.90 ±10% 6.87 ± 6% ~ (p=1.000 n=9+9) kv0/enc=false/nodes=3/cpu=32 14.4 ± 9% 14.4 ± 6% ~ (p=0.707 n=10+9) kv95/enc=false/nodes=3/cpu=32 9.79 ± 7% 9.21 ± 3% -5.90% (p=0.015 n=10+8) ycsb/A/nodes=3/cpu=32 78.0 ± 9% 75.5 ± 6% ~ (p=0.305 n=10+10) ``` This patch therefore always starts the heartbeat loop for all transactions, but only after the heartbeat interval (1 second) passes. In the typical 1PC case, we therefore do not spawn a goroutine nor write a transaction record anyway. The `EndTxn.TxnHeartbeating` field is retained for compatibility with v21.1 nodes, otherwise these nodes would never check for txn records when evaluating possible 1PC writes. Release note (performance improvement): Improved concurrency control for heavily contended write queries outside of transactions that touch multiple ranges, reducing excessive aborts and retries.

Potential 1PC transactions must currently check for an existing transaction record on disk. This patch changes the timestamp cache tombstone markers to be added when the record is first written rather than when it's finalized, which allows using the timestamp cache for this check, avoiding a disk read. This also changes the name `ABORT_REASON_ALREADY_COMMITTED_OR_ROLLED_BACK_POSSIBLE_REPLAY` to `ABORT_REASON_RECORD_ALREADY_WRITTEN_POSSIBLE_REPLAY`, to reflect the new tombstone semantics. The Protobuf identifier is retained for wire compatibility. Release note (api change): The transaction abort error reason `ABORT_REASON_ALREADY_COMMITTED_OR_ROLLED_BACK_POSSIBLE_REPLAY` has been renamed to `ABORT_REASON_RECORD_ALREADY_WRITTEN_POSSIBLE_REPLAY`.

erikgrinaker · 2021-07-26T12:47:26Z

Since @andreimatei is out on vacation I won't wait for his approval here. His comments have been addressed. TFTRs!

bors r=nvanbenschoten,andreimatei

craig · 2021-07-26T13:42:07Z

Build succeeded:

GitHub CI (Cockroach)

erikgrinaker requested review from andreimatei and nvanbenschoten July 5, 2021 12:30

erikgrinaker self-assigned this Jul 5, 2021

erikgrinaker force-pushed the txn-always-heartbeat branch from 780a7b1 to 151d782 Compare July 5, 2021 12:31

nvanbenschoten reviewed Jul 7, 2021

View reviewed changes

erikgrinaker commented Jul 7, 2021

View reviewed changes

erikgrinaker force-pushed the txn-always-heartbeat branch 2 times, most recently from 40c45ac to f7ded0b Compare July 9, 2021 13:15

erikgrinaker commented Jul 9, 2021

View reviewed changes

andreimatei reviewed Jul 13, 2021

View reviewed changes

erikgrinaker commented Jul 20, 2021

View reviewed changes

erikgrinaker force-pushed the txn-always-heartbeat branch from 187cf47 to 10b98c9 Compare July 20, 2021 12:46

andreimatei reviewed Jul 20, 2021

View reviewed changes

nvanbenschoten reviewed Jul 20, 2021

View reviewed changes

erikgrinaker force-pushed the txn-always-heartbeat branch 3 times, most recently from 02bca19 to 489edcf Compare July 21, 2021 20:17

erikgrinaker commented Jul 21, 2021

View reviewed changes

erikgrinaker force-pushed the txn-always-heartbeat branch 2 times, most recently from f598180 to 3483057 Compare July 22, 2021 10:45

erikgrinaker commented Jul 22, 2021

View reviewed changes

erikgrinaker force-pushed the txn-always-heartbeat branch from 3483057 to 531864c Compare July 22, 2021 11:54

nvanbenschoten reviewed Jul 22, 2021

View reviewed changes

erikgrinaker force-pushed the txn-always-heartbeat branch from 531864c to 87d6456 Compare July 23, 2021 13:15

erikgrinaker commented Jul 23, 2021

View reviewed changes

nvanbenschoten approved these changes Jul 23, 2021

View reviewed changes

erikgrinaker force-pushed the txn-always-heartbeat branch from 87d6456 to 8e2150d Compare July 23, 2021 17:10

erikgrinaker commented Jul 23, 2021

View reviewed changes

andreimatei reviewed Jul 23, 2021

View reviewed changes

erikgrinaker force-pushed the txn-always-heartbeat branch from 8e2150d to 42ffe12 Compare July 26, 2021 08:58

erikgrinaker commented Jul 26, 2021

View reviewed changes

erikgrinaker force-pushed the txn-always-heartbeat branch from 42ffe12 to b36b4c0 Compare July 26, 2021 09:19

erikgrinaker added 2 commits July 26, 2021 11:07

erikgrinaker force-pushed the txn-always-heartbeat branch from b36b4c0 to 1a9a35c Compare July 26, 2021 11:08

craig bot merged commit 64c888f into cockroachdb:master Jul 26, 2021

erikgrinaker mentioned this pull request Aug 13, 2021

roachperf: July 28th performance regression on master #68887

Closed

nvanbenschoten mentioned this pull request Aug 16, 2021

kv/kvclient/kvcoord: TestTxnCoordSenderPipelining failed #68948

Closed

erikgrinaker deleted the txn-always-heartbeat branch August 23, 2021 09:03

This was referenced Sep 8, 2021

kvcoord: always start TxnCoordSender heartbeat loop cockroachdb/docs#11535

Open

kvcoord: always start TxnCoordSender heartbeat loop cockroachdb/docs#11606

Open

nvanbenschoten mentioned this pull request Mar 25, 2022

kv: avoid immediately launching goroutine for txn heartbeat loop #35009

Closed

exalate-issue-sync bot mentioned this pull request Jun 8, 2022

kvcoord: always start TxnCoordSender heartbeat loop cockroachdb/docs#14226

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvcoord: always start TxnCoordSender heartbeat loop #67215

kvcoord: always start TxnCoordSender heartbeat loop #67215

erikgrinaker commented Jul 5, 2021 •

edited

Loading

cockroach-teamcity commented Jul 5, 2021

nvanbenschoten left a comment

erikgrinaker left a comment

erikgrinaker left a comment

andreimatei left a comment

erikgrinaker left a comment

andreimatei left a comment

nvanbenschoten left a comment

erikgrinaker left a comment

erikgrinaker left a comment

nvanbenschoten left a comment

erikgrinaker left a comment

nvanbenschoten left a comment

erikgrinaker left a comment

erikgrinaker left a comment

andreimatei left a comment

andreimatei left a comment

erikgrinaker left a comment

erikgrinaker commented Jul 26, 2021

craig bot commented Jul 26, 2021

kvcoord: always start TxnCoordSender heartbeat loop #67215

kvcoord: always start TxnCoordSender heartbeat loop #67215

Conversation

erikgrinaker commented Jul 5, 2021 • edited Loading